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Interactive-engagement vs traditional methods: A six-thousand- 
student survey of mechanics test data for introductory physics 
courses’*’ 

Richard R. Hake®) 

Department of Physics, Indiana University, Bloomington, Indiana 47405 

A survey of pre/post test data using the Halloun-Hestenes Mechanics Diagnostic test 
or more recent Force Concept Inventory is reported for 62 introductory physics courses 
enrolling a total number of students N = 6542. A consistent analysis over diverse student 
populations in high schools, colleges, and universities is obtained if a rough measure of 
the average effectiveness of a course in promoting conceptual understanding is taken to 
be the average normalized gain <g>. The latter is defined as the ratio of the actual 
average gain (%<post> - %<pre>) to the maximum possible average gain (100 - 
%<pre>). Fourteen "traditional" (T) courses (N = 2084) which made little or no use of 
interactive-engagement (IE) methods achieved an average gain <g>T-ave = 

(std dev). In sharp contrast, forty-eight courses (N = 4458) which made substantial use of 
IE methods achieved an average gain <g>iE-ave ~ almost two 

standard deviations of <g>iE-ave above that of the traditional courses. Results for 30 
(N = 3259) of the above 62 courses on the problem-solving Mechanics Baseline test of 
Hestenes-Wells imply that IE strategies enhance problem-solving ability. The conceptual 
and problem-solving test results strongly suggest that the classroom use of IE methods 
can increase mechanics-course effectiveness well beyond that obtained in traditional 
practice. 

I. INTRODUCTION 

There has been considerable recent effort to improve introductory physics courses, especially 
after 1985 when Halloun and Hestenes^ published a careful study using massive pre- and post- 
course testing of students in both calculus and non-calculus-based introductory physics courses 
at Arizona State University. Their conclusions were: (1) "....the student’s initial qualitative, 
common-sense beliefs about motion and.... (its).... causes have a large effect on performance in 
physics, but conventional instruction induces only a small change in those beliefs." 

(2) "Considering the wide differences in the teaching styles of the four professors.. ..(involved in 
the study). ...the basic knowledge gain under conventional instruction is essentially independent 
of the professor." These outcomes were consistent with earlier findings of many researchers in 
physics education (see refs. 1 - 8 and citations therein) which suggested that traditional passive- 
student introductory physics courses, even those delivered by the most talented and popular 
instructors, imparted little conceptual understanding of Newtonian mechanics. 

To what extent has the recent effort to improve introductory physics courses succeeded? In 
this article I report a survey of all quantitative pre/post test results known to me (in time to be 
included in this report) which use the original Halloun-Hestenes Mechanics Diagnostic test 
(MD),la the more recent Force Concept Inventory (FCI),^^’^) and the problem-solving Mechanics 



* Accepted for publication in the American Journal of Physics. Comments and criticisms will be 
welcomed at R.R. Hake, 24245 Hatteras St., Woodland Hills, CA, USA 91367, <hake@ix.netcom.com>. 
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Baseline (MB)IO test. Both the MD and FCI were designed to be tests of students’ conceptual 
understanding of Newtonian mechanics. One of their outstanding virtues is that the questions 
probe for conceptual understanding of basic concepts of Newtonian mechanics in a way that is 
understandable to the novice who has never taken a physics course, while at the same time 
rigorous enough for the initiate. 

Most physicists would probably agree that a low score on the FCI/MD test indicates a lack of 
understanding of the basic concepts of mechanics. However, there have been recent con^ 1 and 
prol2 arguments as to whether a high FCI score indicates the attainment of a unified force 
concept. Nevertheless, even the detractors have conceded that "the FCI is one of the most 
reliable and useful physics tests currently available for introductory physics teachers"! and that 
the FCI is "the best test currently available .... to evaluate the effectiveness of instruction in 
introductory physics courses."! !!> While waiting for the fulfillment of calls for the development 
of better tests! 1 or better analyses of existing tests,!^ the present survey of 

published!^’8^’9^’!3’!‘^ and unpublished!5a,b classroom results may assist a much needed further 
improvement in introductory mechanics instruction in the light of practical experience. 

II. SURVEY METHOD AND OBJECTIVE 

Starting in 1992, 1 requested that pre/post FCI test data and posttest MB data be sent to me in 
talks at numerous colloquia and meetings and in e-mail postings on the PHYS-L and PhysLmR 
nets.!6 This mode of data solicitation tends to pre-select results which are biased in favor of 
outstanding courses which show relatively high gains on the FCI. When relatively low gains are 
achieved (as they often are) they are sometimes mentioned informally, but they are usually 
neither published nor communicated except by those who (a) wish to use the results from a 
"traditional" course at their institution as a baseline for their own data, or (b) possess unusual 
scientific objectivity and detachment. Fortunately, several in the latter category contributed data 
to the present survey for courses in which interactive engagement methods were used but 
relatively low gains were achieved. Some suggestions (Sec. VII) for increasing course 

effectiveness have been gleaned from those cases. !^ 

Some may think that the present survey presents a negatively biased sampling of traditional 
courses, an attitude which has been known to change after perusal of local FCI test results.!^ It 
should be emphasized that all traditional-course pre/post test data known to me in time to be 
included in this report are displayed in Fig. 1. More such data undoubtedly exists but goes 
unreported because the gains are so embarrassingly minimal. 

For survey classification and analysis purposes I define: 

(a) "Interactive Engagement" (IE) methods as those designed at least in part to promote 
conceptual understanding through interactive engagement of students in heads-on 
(always) and hands-on (usually) activities which yield immediate feedback through 
discussion with peers and/or instructors, all as judged by their literature descriptions; 

(b) "Traditional" (T) courses as those reported by instructors to make little or no use of IE 
methods, relying primarily on passive-student lectures, recipe labs, and algorithmic- 
problem exams; 

(c) "Interactive Engagement" (IE) courses as those reported by instructors to make 
substantial use of IE methods', 
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(d) average normalized gain <g> for a course as the ratio of the actual average gain <G> to 
the maximum possible average gain, i.e., 

<g> = %<G> / %<G>niax 

= (%<Sf>- %<Sj>)/(100~ %<Sj>), (1) 

where <Sf> and <Sj> are the final (post) and initial (pre) class averages; 

(e) "High-g’’ courses as those with (<g>) > 0.7; 

(f) "Medium-g” courses as those with 0.7 > (<g>) > 0.3; 

(g) "Low-g" courses as those with (<g>) < 0.3. 

The present survey covers 62 introductory courses enrolling a total of 6542 students using the 
conceptual MD or FCI exams, and (where available) the problem-solving Mechanics Baseline 
(MB) test. Survey results for the conceptual and problem-solving exams are presented below in 
the form of graphs. In a companion paper, intended to assist instructors in selecting and 
implementing proven IE methods, I tabulate, discuss, and reference the particular methods and 
materials that were employed in each of the 62 survey courses. Also tabulated in ref. 17a are data 
for each course: instructor’s name and institution, number of students enrolled, pre/post test 
scores, standard deviations where available, and normalized gains. Survey information was 
obtained from published accounts or private communications. The latter usually included 
instructor responses to a survey questionnaire which asked for information on the pre/post 
testing method; statistical results; institution; type of students; activities of the students; and the 
instructor’s educational experience, outlook, beliefs, orientation, resources, and teaching 
methods. 

As in any scientific investigation, bias in the detector can be put to good advantage if 
appropriate research objectives are established. We do not attempt to access the average 
effectiveness of introductory mechanics courses. Instead we seek to answer a question of 
considerable practical interest to physics teachers: Can the classroom use of IE methods increase 
the effectiveness of introductory mechanics courses well beyond that attained by traditional 
methods! 



III. CONCEPTUAL TEST RESULTS 




Fig. 1. %<Gain> vs %<Hretest> score on the conceptual Mechanics Diagnostic (MD) or Force Concept 
Inventory (FCI) tests for 62 courses enrolling a total N = 6542 students: 14 traditional (T) courses (N = 
2084) which made little or no use of interactive engagement (IE) methods, and 48 IE courses 
(N = 4458) which made considerable use of IE methods. Slope lines for the average of the 14 T courses 
«g»14T and 48 IE courses «g»48lE shown, as explained in the text. 

To increase the statistical reliability (Sec. IV) of averages over courses, only those with 
enrollments N > 20 are plotted in Fig. 1, although in some cases of fairly homogeneous 
instruction and student population (AZ-AP, AZ-Reg, PL92-C, TO, TO-C) courses or sections 
with less than 20 students were included in a number-of-student-weighted average. Course codes 
such as ” AZ-AP” with corresponding enrollments and scores are tabulated and referenced in ref. 
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17a. In assessing the FCI, MD, and MB scores it should be kept in mind that the random 
guessing score for each of these five-alternative multiple-choice tests is 20%. However, 
completely non-Newtonian thinkers (if they can at the same time read and comprehend the 
questions) may tend to score below the random guessing level because of the very powerful 
interview-generated di stractors. 1 2a 

It should be noted that for any particular course point (<G'>, <Sj'>) on the <G> vs <Sj> plot 
of Fig. 1, the absolute value of the slope of a line connecting (<G'>, <Sj'>) with the point 
(<G> = 0, <Sj> = 100) is just the gain parameter <g’> for that particular course. The regularities 
for courses with a wide range of average pretest scores [18 < (<Sj>) <71] and with diverse 
student populations in high schools, colleges, and universities are noteworthy: 

(a) All points for the 14 T courses (N = 2084) fall in the Low-g region. The data^^a yield 

«g»j4'P = 0.23 ± 0.04sd (2a) 

Here and below, double carets "«X»Np" indicate an average of averages, i.e., an 
average of <X> over N courses of type P, and sd = standard deviation [not to be confused 
with random or systematic experimental error (Sec. V)]. 

(b) Eighty-five percent (41 courses, N = 3741) of the 48 IE courses fall in the Medium-g 
region and 15% (7 courses, N = 717) in the Low-g region. Overall, the data^^a yield 

«g»4gjg = 0.48 ± 0. 14sd (2b) 

The slope lines «g» of Eq. (2a, b) are shown in Fig. 1. 

(c) No course points lie in the "High-g" region. 

I infer from features a, b, c that a consistent analysis over diverse student populations with 
widely varying initial knowledge states, as gauged by <Sj>, can be obtained by taking the 
normalized average gain <g> as a rough measure of the effectiveness of a course in promoting 
conceptual understanding. This inference is bolstered by the fact that the correlation of <g> with 
<Sj> for the 62 survey courses is a very low +0.02. In contrast, the average posttest score <Sp> 
and the average gain <G> are less suitable for comparing course effectiveness over diverse 
groups since their correlations with <Sj> are, respectively, +0.55 and —0.49. It should be noted 
that a positive correlation of <Sp> with <Sj> would be expected in the absence of instruction. 

Assuming, then, that <g> is a valid measure of course effectiveness in promoting conceptual 
understanding, it appears that the present interactive engagement courses are, on average, more 
than twice as effective in building basic concepts as traditional courses since «g» jg = 2.1 
«g»j’. The difference 

«g»48IE “ «g»14T = 0-25 (2c) 

is 1.8 standard deviations of «g»48lE ^-2 standard deviations of «g» 14’p, reminiscent of 

that seen in comparing instruction delivered to students in large groups with one-on-one 
instruction. 
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Fig. 2. Histogram of the average normalized gain <g>: dark (red) bars show the fraction of 14 traditional 
courses (N = 2084), and light (green) bars show the fraction of 48 interactive engagement courses (N = 4458), 
both within bins of width 5<g> = 0.04 centered on the <g> values shown. 

Figure 2 shows the <g>-distribution for traditional (T) and interactive engagement (IE) 
courses plotted in Fig. 1. Both distributions deviate from the symmetric Gaussian shape, but this 
does not invalidate characterization of the spread in the data by the standard deviation. 

The widths of the <g> distributions are evidently related to (a) statistical fluctuations in <g> 
associated with widths of the pre- and posttest score distributions as gauged by their standard 
deviations, plus (b) course-to-course variations in the "systematic errors," plus (c) course-to- 
course variations in the effectiveness of the pedagogy and/or implementation. I use the term 
"systematic errors" to mean that for a single course the errors would affect test scores in a 
systematic way, even though such errors might affect different courses in a more-or-less random 
way. Statistical fluctuations and systematic errors in <g> are discussed below in Sec.V. Case 
studies^^^ of the IE courses in the low-end bump of the IE distribution strongly suggest that this 
bump is related to "c" in that various implementation problems are apparent: e.g., insufficient 
training of instructors new to IE methods, failure to communicate to students the nature of 
science and learning, lack of grade incentives for taking IE activities seriously, a paucity of exam 
questions which probe the degree of conceptual understanding induced by the IE methods, and 
use of IE methods in only isolated components of a course. 



6 



Er|c 



? 



B. Gain vs Pretest Graphs for High Schools, Colleges, and Universities 

Figures 3a,b,c show separate G vs Sj plots for the 14 high school (N = 1 1 13), 16 college 
(N = 597), and 32 university courses (N = 4832). Although the enrollment N -weighted average 
pretest scores increase with level^^ = 28%, <Sj>(3 = 39%, <Sj>u = 48% (44% if the 

atypically high Harvard scores are omitted)], in other respects these three plots are all very similar 
to the plot of Fig. 1 for all courses. For high schools, colleges, and universities (a) T courses 
achieve low gains close to the average «g»xi4= 0.23; (b) IE courses are about equally 

effective: «g»ioiE(HS) = ±0.11sd, «g»i3iE(C) = ±0.12sd, and «g»25IE(U) = 

0.45 ± 0.15sd (0.53 ± 0.09sd if the averaging omits the 6 atypical Low-g university courses). 




Fig. 3a. %<Gain> vs %<Pretest> score on the conceptual Mechanics Diagnostic (MD) or Force Concept 
Inventory (FCI) tests for 14 high-school courses enrolling a total of N = 1113 students. In this and 
subsequent figures, course codes, enrollments, and scores are tabulated and referenced in ref. 17a. 



Fig. 3a shows that, for high schools, higher g’s are obtained for honors than for regular 
courses, consistent with the observations of Hestenes et al.^^ The difference between these two 
groups is perceived differently by different instructors and may be school dependent: "the main 
difference is attitude"9a; "they differ in their ability to use quantitative representations of data to 
draw conceptual generalizations. ...motivation is.... only part of the difference''^!; "both sets... 
(are)... highly motivated. ...the major differences. ...(are).... their algebraic skills, the degree of 
confidence in themselves, their ability to pay attention to detail, and their overall ability. "22 
Motivational problems can be especially severe for students in IE courses who dislike any 
departure from the traditional methods to which they have become accustomed and under which 




Fig. 3b. %<Gain> vs %<Pretest> score on the conceptual MD or FCI tests for 16 college courses 
enrolling a total of N = 597 students. The course code "-C" indicates a calculus-based course. 
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Enrollments for the college courses of Fig. 3b are in the 20 - 61 range so that statistical 
fluctuations associated with "random errors" (Sec. V) could be relatively important. However the 
variations in <g> for the eleven Monroe Community College courses (M) have been explained^^^ 
by Paul D’ Alessandris^^ as due to differences in the students or in the instruction: e.g., "With 
regard to the....<g> differences in.... the two sections of calculus-based physics in 1995, M- 
PD95b-C....<g> = 0.64.... was a night course and M-PD95a-C....<g> = 0.47.... was a day course. 
The difference in the student populations between night and day school is the difference between 
night and day. The night students average about 7-10 years older, much more mature and 
dedicated, possibly because they are all paying their own way through school. The actual 
instructional materials and method were the same for both groups. The instructional materials do 
change semester by semester (I hope for the better).... M-PD94-C had <g> = 0.34 (this was the 
first time I used my materials in a calculus-based class.) M-PD95a-C had <g> = 0.47, and in the 

Fall of 1995. ...not included in this survey because N = 15 I had a <g> of 0.63. This change 

is, hopefully, not a random fluctuation but due to the changes in the workbook. All these were 
day courses." Such tracking of <g> with changes in IE method or implementation, also observed 
at Indiana University enhances confidence in the use of <g> as a gauge of course effectiveness 
in building basic concepts. 
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Fig. 

3c. %<Gain> vs %<Pretest> score on the conceptual MD or FCI tests for 32 university courses enrolling 
a total N = 4832 students. The course code "-C" indicates a calculus-based course. 



For university courses (Fig. 3c) six of the IE courses are in the Low-g region - as previously 
indicated, detailed case studies strongly suggest that implementation problems are responsible. 
Ten^^^ of the EE courses in the Medium-g region have enrollments over 100 and four have 
enrollments over 200 - OS95-C: 279; EM94-C: 216; IU95S: 209; IU95F: 388. All the N > 200 
courses28a;29a;30c,d attempt to bring IE methods to the masses in cost-effective ways by means of 
(a) collaborative peer instructional and (b) employment of undergraduate students to augment 
the instructional staff (Sec. VII). 
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The work at Ohio State is part of an ongoing and concerted departmental effort, starting in 
1993, and actively involving about 30% of the faculty.^^^ The long-range goal is to induce a 
badly needed (see the point for OS92-C in Fig. 3c) systemic improvement in the effectiveness of 
all the introductory courses. The largest-enrollment introductory physics course at Ohio State, of 
concern here, is designed for engineering students. In this course there is an unusually heavy 
emphasis on “using symbolic language with understanding to solve complex problems.” In 
addition to "a” and "b,” above, use is made of: (1) Overview Case Studies (OCS),^^^ Active 
Learning Problem Sets (ALPS)^^^ with context-rich problems, and interactive simulations 
with worksheets; all of these in interactive “lectures” (called "Large Room Meetings”); 

(2) cooperative group problem-solving of context-rich problems and multiple-representation 
exercises in "recitations” (called "Small Room Meetings"); (3) an inquiry approach with 
qualitative questions and experiment problems^^^ in the labs. 

Harvard adds Concept Tests, a very complete course Web page,^^^ and computer 
communication between and among students and instructors,^^^»^^ to "a" and "b.” 

Indiana University adds to to "a" and "b”: SDI labs^’l^’^^»^^; Concept Tests^^’^^; 
cooperative group problem-solving in "recitations"^^’^^’^^; computer communication between 
and among students and instructors^^; Minute Papers^^’^^; team teaching^^^»^; a mid-course 
diagnostic student evaluation over all aspects and components of the course an academic 
background questionnaire^^’ which allows instructors to become personally familiar with the 
aspirations and preparation of each incoming student; a "Physics Forum" staffed by faculty and 
graduate students for 5-8 hours/day where introductory students can find help at any time^^; 
color coding8^’13’3^ of displacement, velocity, acceleration, and force vectors in a// components 
of the course; and the use of grading acronyms^^^to increase the efficiency of homework 
grading (e.g., NDC = Not Dimensionally Correct). 

IV. MECHANICS BASELINE TEST RESULTS 

The Mechanics Baseline test is designed to measure more quantitative aspects of student 
understanding than the FCI. It is usually given only as a posttest. Figure 4 shows a plot of the 
average percentage score on the problem-solving Mechanics Baseline (MB) posttest vs the 
average percentage score on the FCI posttest for all the available data.^^^ The solid line is a 
least-squares fit to the data points. The two scores show an extremely strong positive correlation 
with coefficient r = -i- 0.91. Such a relationship is not unreasonable because the MB test (unlike 
most traditional algorithmic-problem physics exams) requires conceptual understanding in 
addition to some mathematical skill and critical thinking. Thus the MB test is more difficult for 
the average student, as is also indicated by the fact that MB averages tend to be about 15% below 
FCI averages, i.e., the least-squares-fit line is nearly parallel to the diagonal (%MB = %FCI) and 
about 15% points below it.^^ 

It is sometimes objected that the problems on the MB test do not sufficiently probe more 
advanced abilities such as those required for problems known as: “context rich”^^; 
"experimenf’2^^; “goal-less" “out-of-lab"^^^; or Fermi. On the other hand, some 
instructors object that neither the MB problems nor those indicated above are “real” problems 
because they are somewhat different from "Halliday -Resnick problems.” Considering the 
differences in outlook, it may be some time before a more widely accepted problem-solving test 
becomes available. 
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Fig. 4. Average posttest scores on the problem-solving Mechanics Baseline (MB) test vs those on the 
conceptual FCI test for all courses of this survey for which data are available: thirty courses (high school, 
college, and university) which enroll a total N = 3259 students (ref. 17a). The solid line is a least-squares 
fit to the data points. The dashed line is the diagonal representing equal scores on the MB and FCI tests. 
Courses at Monroe Community College (M) with a "?" designation had non-matching > ^FCI 
because a few students who took the MB did not also take the FCI pretest, as indicated in ref. 17a. If these 
"?" points are excluded from the analyses, then the correlation coefficient "r" changes by less than 0.1% 
and the change in the position of the least-squares-fit line is almost imperceptible on the scale of this figure. 

Figure 4 shows that EE courses generally show both higher FCI averages and higher MB 
averages than traditional courses, especially when the comparison is made for courses with 
similar student populations, e.g., Cal Poly [CP-C vs (CP-RK-Rega-C, CP-RK-Regb-C, and 
CP-RK-Hon-C)]; Harvard (EM90-C vs EM91,93,94,95-C); Monroe Community College (MCC) 
[M93 vs other M-prefix courses]; Arizona high schools [(AZ-Reg & AZ-AP) vs MW-Hon]. 

Thus it would appear that problem-solving capability is actually enhanced (not sacrificed as 
some would believe) when concepts are emphasized. This is consistent with the observations of 
Mazur^^^ and with the results of Thacker et showing that, at Ohio State, elementary- 
education majors taking an inquiry-based course did better than students enrolled in a 
conventional physics courses for engineers on both a synthesis problem and an analysis problem. 

12 
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V. ERRORS IN THE NORMALIZED GAIN 
A. Statistical Fluctuations ("Random Errors") 

The widths of the distributions of pre- and posttest scores as characterized by their standard 
deviations (7 to 21% of the total number of questions on the exam^^^) are quite large. In most 
cases these widths are not the result of experimental error but primarily reflect the varying 
characteristics of the students. If a multiplicity of understandings, abilities, skills, and attitudes 
affect test performance and these vary randomly among the students then a near Gaussian 
distribution would be expected for high N. Redish^4 calls this "the individuality or ’linewidth’ 
principle." The large linewidths create "random error" uncertainties in the pre- and posttest 
averages and therefore statistical fluctuations ("random errors") A<g> in the average normalized 
gains <g>. I have calculated A<g>’s in the conventional manner45,46 for the 33 survey courses 



for which deviations are available. l^^For this subset : 

«g»X 9 = 0.24 ± 0.03sd, (3a) 

«g»IE24 = 0-50±0.I2sd, (3b) 

similar to the averages and standard deviations for all the data as indicated in Eq. (2a,b). The 
random error averages <(A<g>)> for the subset are 

<(A<g>)>.j,g = 0.04 ± 0.02sd, (4a) 

<(A<g>)>jg24 = - 0-02sd (4b) 



According to the usual interpretation,^^ if only random errors are present then the standard 
deviation for an average of averages, Eq. (3), should be about the same as the uncertainty in any one 
average, Eq. (4). (For a numerical example see ref. 45b.) This would suggest that, for the subset, the 
spread (sd = 0.03) in the <g>x distribution can be accounted for primarily by random-errors 
[<(A<g>)>x 9 = 0.04], while the spread (sd = 0.12) in the <g>iE distribution is due to random errors 
[<(A<g>)>jg 24 = other factors: course-to-course variation in the systematic error, and 

course-to-course variation in the effectiveness of the pedagogy and/or implementation. 

B. Systematic Error 

Aside from the previously mentioned controversy >12 over the interpretation of a high FCI 
score, criticism of FCI testing sometimes involves perceived difficulties such as (I) question 
ambiguities and isolated false positives (right answers for the wrong reasons); and uncontrolled 
variables in the testing conditions such as (2) teaching to the test and test-question leakage, 

(3) the fraction of course time spent on mechanics, (4) post and pretest motivation of students, 
and (5) the Hawthorne/John Henry effects.^2 

For both IE and T courses, the influence of errors "2” through "5" would be expected to vary 
from course to course in a more or less random manner, resulting in a systematic-error "noise" in 
gain vs pretest plots containing data from many courses. Although the magnitude of this noise is 
difficult to estimate, it contributes to the width of the <g> distributions specified in Eq. (2). The 
analysis of random errors above suggests that the systematic-error noise and the course-to-course 
variations in the effectiveness of the pedagogy and/or implementation contribute more 
importantly to the width of the <g>iE distribution than to the width of the <g>x distribution. 
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It is, of course, possible that the systematic errors, even though varying from course-to- 
course, could, on average, positively bias the IE gains so as to increase the difference 
«g»j£ 4 g - «g»j'j 4 . I consider below each of the above-indicated systematic errors. 

1. Question Ambiguities and Isolated False Positives 

The use of a revised version^^ of the FCI with fewer ambiguities and a smaller likelihood of 
false positives has had little impactl’^^ on <g>iE as measured at Indiana and Harvard 

Universities. In addition, (a) interview data9a,12a suggest that ambiguities and false positives are 
relatively rare, (b) these errors would be expected to bias the IE and T courses about equally and 
therefore have little influence on the difference «g»48ffi ~ «g»14T- 

2. Teaching to the Test and Test-question Leakage. 

Considering the elemental nature of the FCI questions, for IE courses both the average 
«g»48IE = maximum <g> = 0.69 are disappointingly low, and below those 

which might be expected if teaching to the test or test-question leakage^^ were important 
influences. 

Of the 48 data sets^^a for IE courses (a) 27 were supplied by respondents to our requests for 
data, of which 22 (81%) were accompanied by a completed survey questionnaire, (b) 13 have 
been discussed in the literature, and (c) 5 are Indiana University courses of which I have first- 
hand knowledge. All survey-form respondents indicated that they thought they had avoided 
"teaching to the test" in answering the question "To what extent do you think you were able to 
avoid ’teaching to the test(s)’ (i.e., going over experiments, questions, or problems identical or 
nearly identical to the test items)?" Likewise, published reports of the courses in group "b" and 
my own knowledge of courses in group "c" suggests an absence of "teaching to the test" in the 
restricted sense indicated in the question. (In the broadest sense, IE courses all "teach to the test" 
to some extent if this means teaching so as to give students some understanding of the basic 
concepts of Newtonian mechanics as examined on the FCI/MD tests. However this is the bias 
we are attempting to measure.) 

There has been no evidence of test-question leakage in the Indiana posttest results (e.g., 
significant mismatches for individual students between FCI scores and other course grades). So 
far there has been only one report^a of such leakage in the literature - as indicated in ref 17a, the 
suspect data were excised from the survey. 

3. Fraction of Course Time Spent on Mechanics 

Comparisons can be made for T and IE courses within the same institution where the 
fraction f = t^/tg of class time t^ spent on mechanics (including energy and momentum 
conservation) to the total semester (or semester-equivalent) time tj is about the same: 



Arizona State (f = 0.8): «g»iE2~ «g»T3 ~ “ 0.23; 

Cal Poly (f=1.0): «g»i£ 3 -«g» 7 i = 0.56-0.25 =0.31; 

Harvard (f = 0.6): «g»iE4 “ «g»Tl = “ 0.29; 

Monroe Com. Coll. (MCC), non-calc, (f = 0.8): «g»iE4 “ «g»Tl = = 0.33; 

MCC, calculus (f = 1.0): «g»iE4 ~ «g»Tl = -0.22 = 0.25; and 

Ohio State (f = 0.7): «g»iEl ~ «g»Tl ~ - 0. 18 = 0.24. 
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Thus a substantial difference «g»iE - «g»T maintained where the time factor is 
equal. 

That the gain difference is not very sensitive to the fraction of the course time spent on 
mechanics over the range common in introductory courses can also be seen from the fact that 
the differences quoted above are rather similar to (a) one another despite the differences in f, and 
(b) the difference «g»iE48 “ «g»T14= which characterizes the entire survey, despite 
the fact that f varies among the survey courses. Questionnaire responses covering 22 of the 
survey courses indicated that f ranged from 0.7 to 1.0 with an average of 0.9 ± O.lsd . 

4. Post and Pretest Motivation of Students 

As indicated in "2" above, of the 48 data sets^^^ for IE courses, 27 were supplied by 
respondents to our requests for data, of which 22 were accompanied by a completed survey 
questionnaire. Responses to the question "Did the FCI posttest count as part of the final grade in 
your course? If so give the approximate weighting factor" were: "No" (50% of the 22 courses 
surveyed); "Not usually" (9%); "Yes, about 5%" (23%); "Yes, weighting factor under 10%" 
(9%); No Response, 9%. For the 1 1 courses for which no grade incentives were ojfered 
«g»j£2 1= 0-49 ± 0. lOsd, close to the average <g> for all the 48 IE courses of the survey 
«g»jE4g = 0.48 ± 0.14sd. Thus it seems doubtful that posttest grade-incentive motivation is 
a significant factor in determining the normalized gain. 

As for the pretest, grade credit is, of course, inappropriate but <g> can be artificially raised if 
students are not induced49 to take the pretest seriously. All surveyed instructors answered 
"Yes" to the survey form question "Do you think that your students exerted serious effort on the 
FCI pretest?" Likewise, published reports of the courses not surveyed and my own knowledge 
of courses at Indiana suggests that students did take the pretest seriously. 

5. Hawthorne/ John Henry Effects^^ 

These effects can produce short-term benefits associated with (a) the special attention (rather 
than the intrinsic worth of the treatment) given to a research test group (Hawthorne effect), or (b) 
the desire of a control group to exceed the performance of a competing test group (John Henry 
effect). Such benefits would be expected to diminish when the treatment is applied as a regular 
long-term routine to large numbers of subjects. Among IE courses, Hawthorne effects should be 
relatively small for courses where IE methods have been employed for many years in regular 
instruction for hundreds of students: five 1994-5 courses at Monroe Community College^^ 

(N =169) ; four 1993-5 courses at Indiana University^^’34 (n = 917); and three 1993-5 courses at 
Harvard^^ (N = 560). For these 12 courses «g»iEl2 = 0*^4 ± O.lOsd, about the same as the 
«g»jE29 = 0-51 ± O.lOsd average of the 29 IE courses (excluding the 7 atypical Low-g 
courses) for which, on average, Hawthorne effects were more likely to have occurred. Students 
may well benefit from the special attention paid to them in regular IE instruction over the long 
term, but this benefit is intrinsic to the pedagogy and should not be classed as a Hawthorne 
effect. I shall not consider John Henry effects because any correction for them would only 
decrease «g»ri4^ thus increase the difference «g»48IE“ «g»14T* 

Although no reliable quantitative estimate of the influence of systematic errors seems 
possible under the present survey conditions, arguments in "1" to "5" above, and the general 
uniformity of the survey results, suggest that it is extremely unlikely that systematic error plays a 
significant role in the nearly two-standard-deviation difference observed in the average 
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normalized gains of T and IE courses shown in Eq. (2c) and in Fig. 1. Thus we conclude that 
this dijference primarily reflects variation in the effectiveness of the pedagogy and/or 
implementation. 

VI. IMPACT OF PHYSICS-EDUCATION RESEARCH 

All interactive-engagement methods used in the survey courses were stimulated in one way 
or another by physics- education research (PER)^^’^^ and cognitive science. It is 
significant that of the 12 IE courses9a,c;21;27;29;30b,c,d;54-56 that achieved normalized gains g > 
0.60 (see Figs. 1,3), 67% were taught at least in part by individuals who had devoted 
considerable attention to PER as judged by their publication of peer-reviewed articles or books 
on that subject [the same can be said for 48% of the 36 IE courses with (<g>) < 0.6]. It is also 
noteworthy that of the 12 IE courses with g > 0.60, 42% utilized texts^^^’^^*^’^^’^^ based on PER 
[the same can be said for 19% of the 36 IE courses with (<g>) < 0.6]. It would thus appear that 
PER has produced very positive results in the classroom. 

For the 48 interactive-engagement courses of Fig. 1, the ranking in terms of number of IE 
courses using each of the more popular methods is - Collaborative Peer Instruction (CPIj^f 32- 
48 {all courses); Microcomputer-Based Labs (MBL)^^: 35; Concept Tests29; 20; Modeling 
19; Active Learning Problem Sets (ALPS)28b or Overview Case Studies (OCS)28b; 17 ; physics- 
education-research based text or no text: 13; and Socratic Dialogue Inducing (SDI) Labs^’^3,34. 

9. [For simplicity, courses combined^^^ into one "course" [TO (8 courses), TO-C (5 courses) 
and IUpre93 (5 courses) are counted as one course each.] The ranking in terms of number of 
students using each method is- CPI: 4458 (all students); MBL: 2704; Concept Tests: 2479; SDI: 
1705; OCS/ALPS: 1101; Modeling: 885; research-based text or no text: 660. 

A detailed breakdown of the instructional strategies as well as materials and their sources for 
each of the 48 IE courses of this survey is presented in a companion article. The IE methods 
are usually interdependent and can be melded together to enhance one another’s strengths and 
modified to suit local conditions and preferences (especially easy if materials are available 
electronically27a,29c,34c so as to facilitate copying, pasting, and cutting). All these IE strategies, 
having proven themselves to be relatively effective in large-scale pre/post testing, deserve 
serious consideration by physics teachers who wish to improve their courses, by physics- 
education researchers, and by designers of new introductory physics course 

VII. SUGGESTIONS FOR COURSE AND SURVEY IMPROVEMENTS 

Although the 48 interactive-engagement courses of Figs. 1-3 appear, on average, to be much 
more effective than traditional courses, none is in the High-g region and some are even in the 
Low-g region characteristic of traditional courses. This is especially disturbing considering the 
elemental and basic nature of the Force Concept Inventory and Mechanics Diagnostic test 
questions. (Many instructors refuse to place such questions on their exams, thinking that they are 
"too simple."lS) As indicated above, case studies^^^ of the Low-g IE courses strongly suggest 
the presence of implementation problems. Similar detailed studies for Medium-g IE courses 
were not carried out, but personal experience with the Indiana courses and communications with 
most of the IE instructors in this study suggest that similar though less severe implementation 
problems (Sec. IIIA) were common. 
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Thus there appear to be no magic bullets among the IE treatments of this survey and more 
work seems to be required on both their content and implementation. As argued more 
trenchantly in ref. 17a, this survey and other work suggests that improvements may occur 
through, e.g., (a) use of IE methods in all components of a course and tight integration of all 
those components'^; (b) careful attention to motivational factors and the provision of grade 
incentives for taking IE activities seriously; (c) administration of exams in which a substantial 
number of the questions probe the degree of conceptual understanding induced by the IE 
methods; (d) inexpensive augmentation of the teaching/coaching staff by undergraduate and 
postdoctoral students^^^»^^; (e) apprenticeship education of instructors new to IE methods^’^^^; 
(f) early recognition and positive intervention for potential low-gain students^^; (g) explicit focus 
on the goals and methods of science^’^^’^^ (including an emphasis on operational 
definitions2’^^»l^»^^’^^^); (h) more personal attention to students by means of human-mediated 
computer instruction in some areas^^’^^; (i) new types of courses^ (j) advances in physics- 
education research and cognitive science. More generally, a redesign process (described by 
Wilson and Daviss^^ and undertaken in refs. 34 and 67) of continuous long-term classroom use, 
feedback, assessment, research analysis, and revision seems to be required for substantive 
educational reform. 

Standards and measurement are badly needed in physics education^^ and are vital 
components of the redesign process. In my view, the present survey is a step in the right 
direction but improvements in future assessments might be achieved through (in approximate 
order of ease of implementation) (1) standardization of test-administration practices^^’^^; (2) use 
of a survey questionnaire refined and sharpened in light of the present experience; (3) more 
widspread use of standardized tests^^il^i^^^‘^»^^^ by individual instructors so as to monitor the 
learning of their students; (4) observation and analysis of classroom activities by independent 
evaluators^^^; (5) solicitation of anonymous information from a large random sample of physics 
teachers; (6) development and use of new and improved versions of the FCI and MB tests, 
treated with the confidentiality of the MCAT,^^ (7) use of E&M concept tests,^^ and 
questionnaires which assess student views on science and leaming^^; and (8) reduction of 
possible teaching-to-the-test influence by drawing test questions from pools such that the specific 
questions are unknown to the instructor.^^ 
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VIII. SUMMARY AND CONCLUSION 

Fourteen traditional (T) courses (N = 2084) which made little or no use of interactive- 
engagement (IE) methods achieved an average gain «g» 24 ’p = 0.23 ± 0.04. In sharp contrast, 
forty-eight IE courses (N = 4458) which made substantial use of IE methods achieved an average 
gain «g»48lE = ± 0.14. It is extremely unlikely that systematic errors play a significant 

role in the nearly two-standard-deviation difference in the normalized gains of the T and IE 
courses. 

A plot of average course scores on the HestenesAVells problem-solving Mechanics Baseline 
test versus those on the conceptual Force Concept Inventory show a strong positive correlation 
with coefficient r = -i- 0.91. Comparison of IE and traditional courses implies that IE methods 
enhance problem-solving ability. 

The conceptual and problem-solving test results strongly suggest that the use of IE strategies 
can increase mechanics-course effectiveness well beyond that obtained with traditional methods. 

Epilogue 

This survey indicates that the strenuous recent efforts to reform introductory physics 
instruction, enlightened by cognitive science and research in physics education, have shown very 
positive results in the classroom. However, history70-75 suggests the possibility that such efforts 
may have little lasting impact. This would be most unfortunate, considering the current 
imperative to (a) educate more effective science majors76 and science-trained professionals,^7 
and (b) raise the appallingly low level of science literacy^^’^^ among the general population. 
Progress towards these goals should increase our chances of solving the monumental science- 
intensive problems^O'^^ (economic, social, political, and environmental) that beset us, but major 
upgrading of physics education on a national scale will probably require (1) the cooperation of 
instructors, departments, institutions, and professional organizations,^^ (2) long-term classroom 
use, feedback, assessment, research analysis, and redesign of interactive-engagement methods.^6 
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